k-means clustering

Given data points 𝐚1,,𝐚nd\mathbf{a}_1,…,\mathbf{a}_n \in \mathbb{R}^d, find centers 𝛍1,...,𝛍kd\mathbf{\mu}_1,...,\mathbf{\mu}_k \in \mathbb{R}^d to minimize: Cost(𝛍1,...,𝛍k)=i=1nminj=1,...,k||𝛍j𝐚i||22Cost(\mathbf{\mu}_1,...,\mathbf{\mu}_k)=\sum_{i=1}^n \min_{j=1,...,k} ||\mathbf{\mu}_j-\mathbf{a}_i||_2^2

Equivalent form: find clusters C1,...,Ck{1,...,n}C_1,...,C_k \subseteq \{1,...,n\} to minimize: Cost(C1,...,Ck)=i=1n12|Cj|u,vCj||𝐚u𝐚v||22Cost(C_1,...,C_k)=\sum_{i=1}^n \frac{1}{2|C_j|}\sum_{u,v \in C_j}||\mathbf{a}_u-\mathbf{a}_v||_2^2 Approximation scheme:

#incomplete